Introduction:
Post-acute sequelae of COVID-19 (PASC), also known as Long COVID syndrome, is the persistence of COVID-19 symptoms long after viral infection (Lamontagne et al,2021). More than one year since its emergence, corona virus disease 2019 (COVID-19) is still looming large with a paucity of treatment options. To add to this burden, a sizeable subset of patients who have recovered from acute COVID-19 infection have reported lingering symptoms, leading to significant disability and impairment of their daily life activities. These patients are considered to suffer from what has been termed as “chronic” or “long” COVID-19 or a form of post-acute sequelae of COVID-19, and patients experiencing this syndrome have been termed COVID-19 long-haulers. Despite recovery from infection, the persistence of atypical chronic symptoms, including extreme fatigue, shortness of breath, joint pains, brain fogs, anxiety, and depression, that could last for months implies an underlying disease pathology that persist beyond the acute presentation of the disease (Ramakrishnan et al,2021)
The purpose of this study was to find out which sequelae can significantly impact physical health in infected individuals. In addition, this study explored whether medication use, and hospitalization could affect the health of infected individuals.
This study will obtain the data required for the study from SHARE data and perform data analysis by using R programming. This study will focus on the following sections: Data cleaning, EDA, Visualization, Modelling, and resampling method.
In the research, we selection variable as below:
Dependent Variable:
cah102_ Health: change in health in the last 3 months
Independent Variable:
cac120_1 COVID-19: fatigue attributed to respondent’s Covid illness
cac120_2 COVID-19: cough, congestion, shortness of breath attributed to respondent’s Covid
cac120_3 COVID-19: loss of taste or smell attributed to respondent’s Covid illness
cac120_4 COVID-19: headache attributed to respondent’s Covid illness
cac120_5 COVID-19: body aches, joint pain attributed to respondent’s Covid illness
cac120_6 COVID-19: chest or abdominal pain attributed to respondent’s Covid illness
cac120_7 COVID-19: diarrhea, nausea attributed to respondent’s Covid illness
cac120_8 COVID-19: confusion attributed to respondent’s Covid illness
At the same time, we will study the health changes and health status of the respondents (different place of residence, gender, country, age). Mainly use Crosstab and ggplot to display.
References:
Lamontagne, S. J., Winters, M. F., Pizzagalli, D. A., & Olmstead, M. C. (2021). Post-acute sequelae of COVID-19: evidence of mood & cognitive impairment. Brain, Behavior, & Immunity-Health, 17, 100347.
Ramakrishnan RK, Kashour T, Hamid Q, Halwani R and Tleyjeh IM (2021) Unraveling the Mystery Surrounding Post-Acute Sequelae of COVID-19. Front. Immunol. 12:686029. doi: 10.3389/fimmu.2021.686029
Data Cleaning
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly) #Plot Package to provide interactivity to ggplot2 charts in our presentation
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data <- as_tibble(read_csv("share8.csv"))
## Rows: 49253 Columns: 308
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (306): mergeid, hhid9ca, mergeidp9ca, coupleid9ca, country, language_ca,...
## dbl (2): exrate, cadn003_
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# head(data)
We have also converted the data dictionary as csv so that we were able to quickly lookup and identify the definitions of the column variables.
#Read Dictionary CSV and Clean Trim all the Columns
dictionary <- read_csv("DataDictionary/SHARE8DataDictionary-converted.csv")
## Rows: 309 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Varname, Label, Categories
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dictionary$Varname <-str_trim(dictionary$Varname, side = c("both"))
dictionary$Label <-str_trim(dictionary$Label, side = c("both"))
dictionary$Categories <-str_trim(dictionary$Categories, side = c("both"))
# head(dictionary)
With importing the data dictionary into a tibble, we also created this custom function below to identify definition for our data.
# This is a function to lookup and retrieve details of a variable
Vardetails <- function(column,details){
tryCatch(
expr = {
as.character(column)
a <- colnames(data[,column])
},
error = function(e){
message("Vardetails Function: Column Not Found, Please Input a single valid variable name i.e.- 'mergeid'")
}
)
tryCatch(
expr = {
if(missing(details)) {
as.character(dictionary[match(colnames(data[,column]),dictionary$Varname),2])
} else {
if(details == 1){
as.character(dictionary[match(colnames(data[,column]),dictionary$Varname),2])
} else if (details == 2) {
as.character(dictionary[match(colnames(data[,column]),dictionary$Varname),3])
} else {
message("Vardetails Function: Error - Opt Arg: 1 for Label, 2 for Category ")
}
}
}
)
}
#Example Usage of the Var details function
Vardetails('mergeid')
## [1] "Person identifier (fix across modules and waves)"
Vardetails('cait104_',1)
## [1] "Social: use of internet for e-mailing, etc. since outbreak"
Vardetails('cait104_',2)
## [1] "Social"
GENERAL TIDYING :
Prior to starting off our exploratory data analysis process, we were able to identify specific rules and methods in order to clean the data in general. We will start off by 1) tidying missing codings and 2) Transforming Binary Values
Based on the share8 pdf documentation provided. Missing Codes are generally tagged as a negative coded value. ~page 16 of SHARE_release_guide_8-0-0.pdf.
With this notion, we are able to filter all negative coded value in the raw file first, to identify whether it is approapriate to be converted to “NA”.
#Identify the index for Coded Variables In The File
CodedVar <- c(9,11:308)
We will start by creating an empty tibble, next we looped through the coded variable and filter for unique negative coded values. The results are appended onto the empty tibble.
#Initiating an empty tibble
tbl_colnames = c('variable','value')
UniqueValues=as_tibble(matrix(nrow = 0, ncol = length(tbl_colnames)), .name_repair = ~ tbl_colnames)
UniqueValues=UniqueValues %>%
mutate(across(everything(), as.character))
UniqueValues %>% summarise_all(class)
## # A tibble: 1 x 2
## variable value
## <chr> <chr>
## 1 character character
#Looping Though Coded Variables
for (val in CodedVar)
{
#Identify the unique values in individual columns, subseqeuntly filter for negative codes with regular expression. "-\\d\\."
B <- unique(data[, val]) %>%
filter(grepl("-\\d\\.",!!as.symbol(colnames(data[, val]))))
#Pivot the data into 2 columns and unite to the main tibble
C <- pivot_longer(B, cols =colnames(B), names_to = "variable", values_to = "value")
UniqueValues <- union(UniqueValues,C)
}
Unique Negative Coded Values can be seeing in the tibble below. We only found 3 negative coded variables in the file
# head(UniqueValues)
unique(UniqueValues[, 'value'])
## # A tibble: 3 x 1
## value
## <chr>
## 1 -2. Refusal
## 2 -1. Don't know
## 3 -9. Not applicable (qqn routing)
It is appropriate to convert these negative coding to “NA”. We then assigned the negative codes identified to a “MissingCoding” List. Additionally, we have also identified other convention in the raw data that is indicating NA result, (‘Not applicable (qnn routing)’,‘99. Not applicable’) these values are also added to the MissingCoding List.
NegativeCoding <- unique(UniqueValues[,2])
MissingCoding <- c(NegativeCoding$value,'Not applicable (qnn routing)','99. Not applicable')
MissingCoding
## [1] "-2. Refusal" "-1. Don't know"
## [3] "-9. Not applicable (qqn routing)" "Not applicable (qnn routing)"
## [5] "99. Not applicable"
We will now 1) look through the coded variable and 2) convert any values that is in the missing coding list into “NA”
#Looping Though The Coded Variables
for (val in CodedVar)
{
#Replace Any Missing Coding to NA
for (x in MissingCoding)
{
data[, val][data[, val] == as.symbol(x)] <- NA
}
}
#Backup
# write.csv(data, file = "Share8MissingValuesCleaned.csv", row.names =FALSE)
# data <- as_tibble(read_csv("Share8MissingValuesCleaned.csv"))
# head(data)
We have transformed the negative coded and other applicable values into NA. Now we will proceed to step 2, which is the handling of binary values.
Prior to delving into the analysis, we can also identify and transform the columns containing only binary values into 1s and 0s. We can I identify binary values by starting with the function below.
#Function to identify Columns containing only 2 unique values after removing NAs
is.binary <- function(v) {
x <- length(na.omit(unique(v)))==2
}
#The function can be vapply on the dataset to return us TRUE and FALSE results for each columns
B <- vapply(data, is.binary, logical(1))
B
## mergeid hhid9ca mergeidp9ca coupleid9ca country language_ca
## FALSE FALSE FALSE FALSE FALSE FALSE
## exrate cadn042_ cadn002_ cadn003_ cas140_ caho100_
## FALSE TRUE FALSE FALSE FALSE TRUE
## caho037_ caho136_ caho032_ caph003_ cah102_ cah004_1
## FALSE FALSE FALSE FALSE FALSE TRUE
## cah004_2 cah004_3 cah004_4 cah004_5 cah004_6 cah004_7
## TRUE TRUE TRUE TRUE TRUE TRUE
## caph105_ caph089_1 caph089_2 caph089_3 caph089_4 cah006_
## FALSE TRUE TRUE TRUE TRUE TRUE
## cah007_1 cah007_2 cah007_3 cah007_4 cah007_5 cah007_6
## TRUE TRUE TRUE TRUE TRUE TRUE
## cah007_7 cah110_ cah111_3 cah111_6 cah111_7 cah111_8
## TRUE TRUE FALSE FALSE FALSE FALSE
## cah111_11 cah113_ cah116_ cac140_ cac142_ cac143_
## FALSE FALSE FALSE FALSE TRUE FALSE
## cah017_ cahc117_ cahc118_ cahc884_ cahc119_ cah020_
## TRUE TRUE FALSE TRUE TRUE TRUE
## cah121_1 cah121_2 camh002_ camh113_1 camh113_2 camh007_
## FALSE TRUE TRUE FALSE TRUE TRUE
## camh118_1 camh118_2 camh037_ camh148_ cac102_ cac103_1
## FALSE TRUE FALSE FALSE TRUE TRUE
## cac103_2 cac103_3 cac103_4 cac103_5 cac103_6 cac103_7
## TRUE TRUE TRUE TRUE TRUE TRUE
## cac103_8 cac103_97 cac103_3b cac103_4b cac103_5b cac103_6b
## TRUE TRUE FALSE FALSE FALSE FALSE
## cac103_7b cac103_8b cac103_97b cac104_ cac105_1 cac105_2
## FALSE FALSE FALSE TRUE TRUE TRUE
## cac105_3 cac105_4 cac105_5 cac105_6 cac105_7 cac105_8
## TRUE TRUE TRUE TRUE TRUE TRUE
## cac105_97 cac105_3b cac105_4b cac105_5b cac105_6b cac105_7b
## TRUE FALSE FALSE FALSE FALSE FALSE
## cac105_8b cac105_97b cac120_1 cac120_2 cac120_3 cac120_4
## FALSE FALSE TRUE TRUE TRUE TRUE
## cac120_5 cac120_6 cac120_7 cac120_8 cac120_97 cac120_98
## TRUE TRUE TRUE TRUE TRUE TRUE
## cac122_ cac130_ cac131_ cac110_ cac111_1 cac111_2
## TRUE FALSE TRUE TRUE TRUE TRUE
## cac111_3 cac111_4 cac111_5 cac111_6 cac111_7 cac111_8
## TRUE TRUE TRUE TRUE TRUE TRUE
## cac111_97 cac111_3b cac111_4b cac111_5b cac111_6b cac111_7b
## TRUE TRUE FALSE FALSE FALSE FALSE
## cac111_8b cac111_97b cac113_ cac114_2 cac114_3 cac114_4
## FALSE FALSE TRUE TRUE TRUE TRUE
## cac114_5 cac114_6 cac114_7 cac114_8 cac114_97 cac114_3b
## TRUE TRUE TRUE TRUE TRUE TRUE
## cac114_4b cac114_5b cac114_6b cac114_7b cac114_8b cac114_97b
## TRUE FALSE FALSE FALSE TRUE FALSE
## caq105_ caq106_1 caq106_2 caq106_3 caq106_4 caq106_97
## TRUE TRUE TRUE TRUE TRUE TRUE
## caq110_ caq111_1 caq111_2 caq111_3 caq111_4 caq111_97
## TRUE TRUE TRUE TRUE TRUE TRUE
## caq115_ caq116_1 caq116_2 caq116_3 caq116_4 caq116_97
## TRUE TRUE TRUE TRUE TRUE TRUE
## caq130_1 caq130_2 caq130_3 caq130_4 caq130_97 caq125_
## TRUE TRUE TRUE TRUE TRUE TRUE
## caq127_ caq128_1 caq128_2 caq128_3 caq128_4 caq128_5
## FALSE TRUE TRUE TRUE TRUE TRUE
## caq128_97 caq120_ caq121_ caq122_ caq123_1 caq123_2
## TRUE TRUE TRUE FALSE TRUE TRUE
## caq123_3 caq123_4 caq123_5 caq123_97 caq118_ caq119_
## TRUE TRUE TRUE TRUE FALSE FALSE
## caep005_ caw102_ caw103_ caep100_ caep101_2 caep101_1
## FALSE FALSE FALSE TRUE FALSE TRUE
## caep102_ caep103_ caw110_1 caw110_2 caw110_3 caw111_
## FALSE TRUE TRUE TRUE TRUE FALSE
## caw117_ caw121_ caw122_ caw123_1 caw123_2 caw123_3
## FALSE TRUE FALSE FALSE TRUE FALSE
## caw123_4 caw124_ caw125_ caw126_1 caw126_2 caw126_3
## TRUE TRUE FALSE FALSE TRUE FALSE
## caw126_4 cae103_ cae104_1 cae104_2 cae104_3 cae104_4
## TRUE TRUE TRUE TRUE TRUE TRUE
## cae104_97 cae114_1 cae114_2 cae114_3 cae114_4 cae001_
## TRUE TRUE TRUE TRUE TRUE TRUE
## cae100_ cae105e cae106_1 cae106_2 cae106_3 cae106_4
## TRUE FALSE FALSE TRUE FALSE TRUE
## cae107e cae108_1 cae108_2 cae108_3 cae108_4 cae109_1
## FALSE FALSE TRUE FALSE TRUE TRUE
## cae109_2 cae109_3 cae109_4 cae109_5 cae109_6 cae109_7
## TRUE TRUE TRUE TRUE TRUE TRUE
## cae109_8 cae109_98 casr006_ caco107_ cae111_ cae112_
## TRUE TRUE FALSE FALSE TRUE FALSE
## cae120_ cas103_1 cas103_5 cas103_2 cas103_3 cas103_4
## FALSE FALSE FALSE FALSE FALSE FALSE
## cas104_1 cas104_2 cas104_3 cas104_4 cas110_1 cas110_2
## FALSE FALSE FALSE FALSE TRUE TRUE
## cas110_3 cas110_4 cas111_1 cas111_2 cas111_3 cas111_4
## TRUE TRUE FALSE FALSE FALSE FALSE
## cas112_1 cas112_2 cas112_3 cas112_4 cas113_1 cas113_2
## TRUE TRUE TRUE TRUE FALSE FALSE
## cas113_3 cas113_4 cas115_ cas116_ cas120_1 cas120_2
## FALSE FALSE TRUE FALSE TRUE TRUE
## cas120_3 cas120_4 cas121_1 cas121_2 cas121_3 cas121_4
## TRUE TRUE FALSE FALSE FALSE FALSE
## cas125_ cas130_1 cas130_2 cas130_3 cas130_4 cas130_5
## TRUE TRUE TRUE TRUE TRUE TRUE
## cas131_1 cas131_2 cas131_3 cas131_4 cas131_5 cas126_
## FALSE TRUE FALSE FALSE FALSE TRUE
## cas127_1 cas127_2 cas127_3 cas127_4 cas127_5 cait104_
## TRUE TRUE TRUE TRUE TRUE TRUE
## cait105_ cait106_3 cait106_4 cait106_5 cait106_6 caf001_
## TRUE FALSE FALSE FALSE FALSE FALSE
## caf002_ caf005_
## FALSE FALSE
Subsequently we can filter for the column index which have only unique 2 values and assigned it to a new list.
C <- B %>% grep(TRUE,.)
length(C)
## [1] 192
#Removing Gender Variable
BinaryVar <- C[-1]
BinaryVar
## [1] 12 18 19 20 21 22 23 24 26 27 28 29 30 31 32 33 34 35
## [19] 36 37 38 47 49 50 52 53 54 56 57 59 60 62 65 66 67 68
## [37] 69 70 71 72 73 74 82 83 84 85 86 87 88 89 90 91 99 100
## [55] 101 102 103 104 105 106 107 108 109 111 112 113 114 115 116 117 118 119
## [73] 120 121 122 129 130 131 132 133 134 135 136 137 138 139 143 145 146 147
## [91] 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
## [109] 166 167 168 170 171 172 173 174 175 176 177 179 180 181 182 183 184 190
## [127] 192 194 195 196 197 200 203 205 206 209 211 212 213 214 215 216 217 218
## [145] 219 220 221 222 223 226 228 231 233 234 235 236 237 238 239 240 241 242
## [163] 245 257 258 259 260 265 266 267 268 273 275 276 277 278 283 284 285 286
## [181] 287 288 290 294 295 296 297 298 299 300 301
Similar to the approach earlier for missing code, we will 1) initiate an empty tibble, 2) loop through and unite unique values in the binary columns. In order to identify whether it is approapriate to transform the data
#Initiating an empty tibble
tbl_colnames = c('variable','value')
BinaryValues=as_tibble(matrix(nrow = 0, ncol = length(tbl_colnames)), .name_repair = ~ tbl_colnames)
BinaryValues=BinaryValues %>%
mutate(across(everything(), as.character))
BinaryValues %>% summarise_all(class)
## # A tibble: 1 x 2
## variable value
## <chr> <chr>
## 1 character character
#Loop Though the Binary Variables list with only 2 values
for (val in BinaryVar)
{
# Unite all the unique results into a new tibble "BinaryValues"
B <- unique(data[, val])
C <- pivot_longer(B, cols =colnames(B), names_to = "variable", values_to = "value")
BinaryValues <- union(BinaryValues,C)
}
BinaryValues <- BinaryValues%>% filter(!is.na(BinaryValues$value))
head(BinaryValues)
## # A tibble: 6 x 2
## variable value
## <chr> <chr>
## 1 caho100_ 1. Yes
## 2 caho100_ 5. No
## 3 cah004_1 5. No
## 4 cah004_1 1. Yes
## 5 cah004_2 5. No
## 6 cah004_2 1. Yes
We can now export the resulting tibble to csv to explore in excel whether the columns should be transformed
# write.csv(BinaryValues, file = "BinaryValues.csv", row.names =FALSE)
We are also able to aggregate and count all the occurence of elements in the binary columns
BinElements <- BinaryValues %>%
count(value, sort = TRUE)
BinElements
## # A tibble: 14 x 2
## value n
## <chr> <int>
## 1 1. Yes 90
## 2 5. No 90
## 3 0. Not selected 83
## 4 1. Selected 83
## 5 2020 9
## 6 2021 9
## 7 1 4
## 8 2. About the same 4
## 9 1. Less so 3
## 10 2 3
## 11 1. Trouble with sleep or recent change in pattern 1
## 12 2. No trouble sleeping 1
## 13 3. More often 1
## 14 30 1
By exploring the csv exported and the count of elements above, we decided to only transformed the binary variables column containing the elements (“1. Yes”, “5. No”, “0. Not selected”,“1. Selected”)
#Excluding Element that should not be transformed
ExcludeBin <- BinElements%>% filter(BinElements$n<10)
ExcludeBin$value
## [1] "2020"
## [2] "2021"
## [3] "1"
## [4] "2. About the same"
## [5] "1. Less so"
## [6] "2"
## [7] "1. Trouble with sleep or recent change in pattern"
## [8] "2. No trouble sleeping"
## [9] "3. More often"
## [10] "30"
#Subsequently Identifying Binary Column Not Eligible or Advisable for Data Transformation
ExcludeBinCols <- BinaryValues %>% filter(value%in%ExcludeBin$value) %>% select(variable) %>% {unique(.$variable)}
ExcludeBinCols
## [1] "cah121_2" "camh113_2" "camh007_" "camh118_2" "cac111_3b" "cac114_3b"
## [7] "cac114_4b" "cac114_8b" "caep101_1" "caw123_2" "caw123_4" "caw126_2"
## [13] "caw126_4" "cae106_2" "cae106_4" "cae108_2" "cae108_4" "cas131_2"
From the BinaryValues tibble created earlier, we are able to assign all the variable into a new list
BinCols <- BinaryValues %>% {unique(.$variable)}
By Comparing the set different between all binary columns and the binary columns to be excluded, we can now create a third list which are only binary column that is eligible for transformation.
EligBinCols <- setdiff(BinCols,ExcludeBinCols) # Compare Sets of Variable and Remove Not Elligible Columns
#Compare Lengths of List for Validation
length(BinCols) # Total Binary Variables
## [1] 191
length(ExcludeBinCols) # Binary Variables to be excluded
## [1] 18
length(EligBinCols) # Remaining Binary Variables For Transformation
## [1] 173
After we have obtain the final list of Columns to be transformed, we can then quickly mutate and recode all with the code below
data <- data %>%
mutate_at(vars(colnames(data[EligBinCols])),
~as.numeric(recode(.,
"1. Yes"=1,
"5. No"=0,
"1. Selected"=1,
"0. Not selected"=0)))
# # Backup
# data <- as_tibble(read_csv("Share8MissingValuesBinaryCleaned.csv"))
Prior to in depth analysis, we have also identify appropriate columns to be renamed, in order to smoothen the process ahead.
#With the function created earlier, we are able to quickly lookup column definition on the fly.
#Below are the dictionary definition for the columns we used in our analysis
for (i in c('country','cadn042_','cadn003_','caho037_','cah102_','caph003_','cac105_1','cac120_1','cac120_2','cac120_3','cac120_4','cac120_5','cac120_6','cac120_7','cac120_8','cac122_','cac111_1'))
{
print(paste(Vardetails(i,2)," - ", i," - ", Vardetails(i)))
}
## [1] "Intro - country - Country identifier"
## [1] "Intro - cadn042_ - Intro: sex"
## [1] "Intro - cadn003_ - Intro: year of birth"
## [1] "Intro - caho037_ - Intro: area lived in"
## [1] "Health - cah102_ - Health: change in health in the last 3 months"
## [1] "Health - caph003_ - Health: rating of subjective health"
## [1] "COVID-19 - cac105_1 - COVID-19: respondent tested positive"
## [1] "COVID-19 - cac120_1 - COVID-19: fatigue attributed to respondent's Covid illness"
## [1] "COVID-19 - cac120_2 - COVID-19: cough, congestion, shortness of breath attributed to respondent's Covi"
## [1] "COVID-19 - cac120_3 - COVID-19: loss of taste or smell attributed to respondent's Covid illness"
## [1] "COVID-19 - cac120_4 - COVID-19: headache attributed to respondent's Covid illness"
## [1] "COVID-19 - cac120_5 - COVID-19: body aches, joint pain attributed to respondent's Covid illness"
## [1] "COVID-19 - cac120_6 - COVID-19: chest or abdominal pain attributed to respondent's Covid illness"
## [1] "COVID-19 - cac120_7 - COVID-19: diarrhoea, nausea attributed to respondent's Covid illness"
## [1] "COVID-19 - cac120_8 - COVID-19: confusion attributed to respondent's Covid illness"
## [1] "COVID-19 - cac122_ - COVID-19: drugs taken by respondent to alleviate these [long-term or lingering]"
## [1] "COVID-19 - cac111_1 - COVID-19: respondent hospitalized"
Based on the definitions above, we can also first rename some keep variables to be used in our analysis
data <- data %>%
rename(.,
country=country,
gender=cadn042_,
yearborn=cadn003_,
living_area=caho037_,
health_change=cah102_,
health_rate=caph003_,
test_positive=cac105_1,
sequelae1=cac120_1,
sequelae2=cac120_2,
sequelae3=cac120_3,
sequelae4=cac120_4,
sequelae5=cac120_5,
sequelae6=cac120_6,
sequelae7=cac120_7,
sequelae8=cac120_8,
drugs_taken = cac122_,
hospitalized = cac111_1,)
Additionally, we are also able to estimate the respondents age, by substracting “2020”, (which was the main period for the share8 survey) ~release survey page 10 against the respondents “cadn003_” year of birth data. The result can then be grouped into different age bins.
#Estimate Respondents Age
data$age <- 2020-data$yearborn
#Grouped into Age Bins
data <- data %>%
mutate(agegroup = cut(age, seq(0, max(age) + 6, 10), right = FALSE))
head(data[,c('yearborn','age','agegroup')])
## # A tibble: 6 x 3
## yearborn age agegroup
## <dbl> <dbl> <fct>
## 1 1952 68 [60,70)
## 2 1951 69 [60,70)
## 3 1924 96 [90,100)
## 4 1942 78 [70,80)
## 5 1942 78 [70,80)
## 6 1951 69 [60,70)
#Quick Recoding of Gender Variable
data <- data %>%
mutate(gender=recode(gender,
"1. Male"='Male',
"2. Female"='Female'))
#BACKUP
# write.csv(data, file = "Share8MissingValuesBinaryRenamedCleaned.csv", row.names =FALSE)
#EDA Starts Here
As a whole, the share8 dataset provided contains 318 columns (+2 derived age variable) with 49253 observation
# data
attach(data)
dim(data)
## [1] 49253 310
In this section, we will show all the variables used in this report. Including: country, gender, age, agegroup, living_area, health_change, health_rate, test_positive, eight major sequelaes. And these variables will be used in the EDA and Modelling sections respectively.
#Create the data frame for this report
df<-data.frame(country,gender,age,agegroup,living_area,health_change,health_rate,test_positive,sequelae1,sequelae2,sequelae3,sequelae4,sequelae5,sequelae6,sequelae7,sequelae8,drugs_taken,hospitalized)
#Check size of data frame
dim(df)
## [1] 49253 18
#Check variable names
head(df)
## country gender age agegroup living_area health_change health_rate
## 1 Austria Female 68 [60,70) <NA> 2. About the same 2. Very good
## 2 Austria Male 69 [60,70) <NA> 2. About the same 2. Very good
## 3 Austria Male 96 [90,100) <NA> 2. About the same 3. Good
## 4 Austria Female 78 [70,80) <NA> 3. Worsened 4. Fair
## 5 Austria Male 78 [70,80) <NA> 1. Improved 4. Fair
## 6 Austria Female 69 [60,70) <NA> 2. About the same 3. Good
## test_positive sequelae1 sequelae2 sequelae3 sequelae4 sequelae5 sequelae6
## 1 NA NA NA NA NA NA NA
## 2 0 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## sequelae7 sequelae8 drugs_taken hospitalized
## 1 NA NA NA NA
## 2 NA NA NA 0
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA 0
Some general demographic variable are available and derived from the dataset
Firstly we can understand the background of our share8 respondents as a whole
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
#Count Respondents Occurence by Country and rename country to region to join against map data
countrydf <- data %>% group_by(country) %>% tally() %>% rename(., region = country)
# countrydf
# 1) Load Map Dataset from Library 2) Left Join Against Share8 Data 3) Filter out unrelated countries
mapdata1 <- map_data("world") %>%
left_join(., countrydf, by="region") %>%
filter(!is.na(.$n))
map2 <- ggplot(mapdata1, aes( x = long, y = lat, group=group)) +
geom_polygon(color = "white",aes(fill = n, text = region)) +
scale_fill_gradient(name = "Participants",
high = munsell::mnsl("5P 2/12"),
low = munsell::mnsl("5P 7/12")
, na.value = "grey50"
)+
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.y=element_blank(),
axis.title.x=element_blank(),
rect = element_rect(fill = "transparent")
)+
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "transparent",
colour = NA_character_),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent"),
legend.title=element_text(color="white")
)
## Warning: Ignoring unknown aesthetics: text
Map3 <- ggplotly(map2,tooltip = c("fill","text"))
Map3
# options(browser = 'false')
# api_create(Map3, filename = "map1")
#https://chart-studio.plotly.com/~plotlys2022355/14
Share8 respondents resides in 28 european countries.
Alternatively, we can also understand the respondent demographic in terms of country and gender. The population pyramid below observed that there are in general more female than male respondents
countrygp <- data %>% group_by(country, gender) %>% tally()
countrygp$norg <-countrygp$n
#Male Values are ploted as negative, with labels showing absolute values to produce population pyramid chart)
countrygp$n[countrygp$gender == "Male"]<-countrygp$n[countrygp$gender == "Male"]*-1
B = ggplot(countrygp,aes(x = reorder(country, norg),text=(country) ,y = n, fill=gender)) +
geom_bar(stat = "identity") +
coord_flip()+
labs(title = "Country of Survey Respondents", x = "",
y = "")+
scale_y_continuous(labels = abs)+
theme(
panel.background = element_rect(fill = "transparent", colour = NA_character_),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "transparent", colour = NA_character_),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
# ,legend.title=element_text(color="white")
# ,legend.text=element_text(color="white")
# ,plot.title=element_text(color="white")
# ,axis.text.y=element_text(color="white")
# ,axis.text.x=element_text(color="white")
)
Bplotly <- ggplotly(B,tooltip = c("y","text"))
Bplotly
# options(browser = 'false')
# api_create(Bplotly, filename = "CountryDistribution")
#https://chart-studio.plotly.com/~plotlys2022355/20
From the pie chart below, female generally consist of 60% of respondents for all countries
genderdf = data[,c('country','gender','agegroup')]
#Created and Union additional Tibble to create additional country category 'All'
allgenderdf = genderdf
allgenderdf$country = 'All'
genderdf = union_all(genderdf,allgenderdf)
genderdf <- genderdf %>% group_by(country, gender) %>% tally()
# head(genderdf)
plotpie <- plot_ly(data=genderdf,labels=~factor(gender), values=~n, frame = ~country, marker = list(colors = c('orchid', 'skyblue')), type="pie"
) %>%
layout(title = 'Breakdown of Participants Gender',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")#%>% layout(font = list(color = 'white'))
plotpie
# options(browser = 'false')
# api_create(plotpie, filename = "GenderPieChart")
#https://chart-studio.plotly.com/~plotlys2022355/38
As we have already derived earlier, We can also explore our data in terms of age.
#Population Pyramid of Respondents Age
#Age Related Tibbles for Chart Constuction
Agedf <- data[,c('country','gender','age')]
agegp <- Agedf %>% group_by(age, gender) %>% tally()
agegp$norg <-agegp$n
#Male Plotted on negative x-axis to achieve the population pyramid view
agegp$n[agegp$gender == "Male"]<-agegp$n[agegp$gender == "Male"]*-1
C = ggplot(agegp,aes(x = age ,y = n, fill=gender)) +
geom_bar(stat = "identity") +
coord_flip()+
labs(title = "Distribution of Respondent's Age", x = "age",
y = "")+
scale_y_continuous(labels = abs)+
# Transparent Aesthetics Themes
theme(
panel.background = element_rect(fill = "transparent", colour = NA_character_),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "transparent", colour = NA_character_),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
)
#Wrapping GGPlot Charts to Provide interactivity
Cplotly<-ggplotly(C,tooltip = c("y","x"))
Cplotly
# options(browser = 'false')
# api_create(Cplotly, filename = "AgeDistribution") #PlotlyStudioUpload
#https://chart-studio.plotly.com/~plotlys2022355/18
We can check age distribution with a box plot
agegender = data[,c('country','gender','age')]
#Created and Union additional Tibble to create additional gender category 'All'
ageall = data[,c('country','age')]
ageall$gender = 'Overall'
agegender = union(agegender,ageall)
#Created and Union additional Tibble to create additional country category 'All'
allcountryagegender = agegender
allcountryagegender$country = 'All'
agegender = union_all(agegender,allcountryagegender)
# Plot
g <- ggplot(agegender, aes(gender, age,fill=gender, frame=country))+
geom_boxplot() +
labs(title="Box plot of Age by Gender and Country",
subtitle="boxplot by Gender",
caption="Source: boxplot",
x="gender",
y="age")+
scale_fill_manual(values=c("pink",
"cadetblue1",
"cornsilk"))+
# Transparent Aesthetics Themes
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "transparent",
colour = NA_character_),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
)
Dplotly <- ggplotly(g,tooltip = c("y","text"))
Dplotly
# options(browser = 'false')
# api_create(Dplotly, filename = "AgeGenderCountryBoxPlot")
#Output Link: https://chart-studio.plotly.com/~plotlys2022355/36
Exploration of Main Variable of our study
#Convert the character and numeric variables in the data frame to factor for further analysis
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
df[sapply(df, is.numeric)] <- lapply(df[sapply(df, is.numeric)], as.factor)
#Browse the basic information of the data frame
summary(df)
## country gender age agegroup
## Estonia : 4069 Female:28684 70 : 2073 [60,70) :18434
## Belgium : 3451 Male :20569 66 : 2031 [70,80) :16541
## Greece : 3399 65 : 2012 [80,90) : 7371
## Italy : 3360 72 : 1999 [50,60) : 5721
## Slovenia: 2946 68 : 1980 [90,100): 934
## Poland : 2798 67 : 1964 [40,50) : 220
## (Other) :29230 (Other):37194 (Other) : 32
## living_area health_change
## 1. A big city : 619 1. Improved : 2917
## 2. The suburbs or outskirts of a big city: 375 2. About the same:39182
## 3. A large town : 564 3. Worsened : 7104
## 4. A small town : 791 NA's : 50
## 5. A rural area or village : 958
## NA's :45946
##
## health_rate test_positive sequelae1 sequelae2 sequelae3
## 1. Excellent: 2143 0 :15099 0 : 2016 0 : 2593 0 : 2805
## 2. Very good: 7648 1 : 3167 1 : 2056 1 : 1479 1 : 1267
## 3. Good :19706 NA's:30987 NA's:45181 NA's:45181 NA's:45181
## 4. Fair :14879
## 5. Poor : 4842
## NA's : 35
##
## sequelae4 sequelae5 sequelae6 sequelae7 sequelae8 drugs_taken
## 0 : 2910 0 : 2728 0 : 3478 0 : 3668 0 : 3733 0 : 1500
## 1 : 1162 1 : 1344 1 : 594 1 : 404 1 : 339 1 : 1444
## NA's:45181 NA's:45181 NA's:45181 NA's:45181 NA's:45181 NA's:46309
##
##
##
##
## hospitalized
## 0 : 5435
## 1 : 594
## NA's:43224
##
##
##
##
1.1 Changes in the health and health status of respondents living in different area
#Import the source and packages required by analysis
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
library(ggplot2)
library(dplyr)
options(dplyr.summarise.inform = FALSE)
#Exhibit the change of health in different living area by crosstab
living_change<-data.frame(df$living_area,df$health_change)
#remove the missing value
living1<-drop_na(rename(living_change, living_area1 =df.living_area,health_change1 = df.health_change))
#build crosstab with count and row precentage
crosstab(living1, row.vars = "living_area1", col.vars = "health_change1",type = c("f", "r"), style = "long",addmargins = FALSE)
## health_change1 1. Improved 2. About the same 3. Worsened
## living_area1
## 1. A big city Count 33.00 482.00 104.00
## Row % 5.33 77.87 16.80
## 2. The suburbs or outskirts of a big city Count 27.00 295.00 52.00
## Row % 7.22 78.88 13.90
## 3. A large town Count 41.00 447.00 76.00
## Row % 7.27 79.26 13.48
## 4. A small town Count 74.00 600.00 116.00
## Row % 9.37 75.95 14.68
## 5. A rural area or village Count 77.00 749.00 132.00
## Row % 8.04 78.18 13.78
#Exhibit the health status in different living area by crosstab
living_rate<-data.frame(df$living_area,df$health_rate)
#remove the missing value
living2<-drop_na(rename(living_rate, living_area1 =df.living_area,health_rate1 = df.health_rate))
#build crosstab with count and row precentage
crosstab(living2, row.vars = "living_area1", col.vars = "health_rate1",type = c("f", "r"), style = "long",addmargins = FALSE)
## health_rate1 1. Excellent 2. Very good 3. Good 4. Fair 5. Poor
## living_area1
## 1. A big city Count 30.00 90.00 204.00 234.00 61.00
## Row % 4.85 14.54 32.96 37.80 9.85
## 2. The suburbs or outskirts of a big city Count 30.00 77.00 147.00 94.00 27.00
## Row % 8.00 20.53 39.20 25.07 7.20
## 3. A large town Count 45.00 85.00 201.00 174.00 59.00
## Row % 7.98 15.07 35.64 30.85 10.46
## 4. A small town Count 48.00 122.00 293.00 250.00 78.00
## Row % 6.07 15.42 37.04 31.61 9.86
## 5. A rural area or village Count 60.00 158.00 341.00 295.00 103.00
## Row % 6.27 16.51 35.63 30.83 10.76
living_areadf = data[,c('country','living_area','health_change','health_rate')]%>% na.omit(.)
living_areadfhealthrate <- living_areadf %>% group_by(health_rate, living_area) %>% tally()
living_areadfhealthchange <- living_areadf %>% group_by(health_change, living_area) %>% tally()
living_areadfhealthratefig1 <- plot_ly(data=living_areadfhealthrate,labels=~factor(health_rate), values=~n, frame = ~living_area, marker = list(colors = c('orchid', 'skyblue')), type="pie"
) %>%
layout(title = 'Original Health Rating',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")#%>% layout(font = list(color = 'white'))
living_areadfhealthchangefig2 <- plot_ly(data=living_areadfhealthchange,labels=~factor(health_change), values=~n, frame = ~living_area, marker = list(colors = c('orchid', 'skyblue')), type="pie"
) %>%
layout(title = 'Health Change in 3 Months',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")#%>% layout(font = list(color = 'white'))
living_areadfhealthratefig1
living_areadfhealthchangefig2
# options(browser = 'false')
# api_create(living_areadfhealthratefig1, filename = "living_areadfhealthratefig1")
# https://chart-studio.plotly.com/~plotlys2022355/52/#/
# options(browser = 'false')
# api_create(living_areadfhealthchangefig2, filename = "living_areadfhealthchangefig2")
# https://chart-studio.plotly.com/~plotlys2022355/54/#/
According to crosstab and Figures, regardless of which area the respondents lived in, those who considered themselves health status is good or fair accounted for two-thirds of the overall living area. Moreover, 15% to 20% of people think they are in very good health, and only a few people think their health status are excellent or poor. Similarly, the number of respondents living in big cities who thought their health was excellent/very good/good was significantly lower than in other regions. On the other hand, respondents living in the suburbs or outskirts of a big city were more likely to feel healthy. And respondents in the rest of the regions were similar in health.
1.2 Changes in health and health status of respondents by Gender
#Exhibit the change of health by gender via crosstab
gender_change<-data.frame(df$gender,df$health_change)
#remove the missing value
gender1<-drop_na(rename(gender_change, gender1 =df.gender,health_change1 = df.health_change))
#build crosstab with count and row precentage
crosstab(gender1, row.vars = "gender1", col.vars = "health_change1",type = c("f", "r"), style = "long",addmargins = FALSE)
## health_change1 1. Improved 2. About the same 3. Worsened
## gender1
## Female Count 1796.00 22419.00 4431.00
## Row % 6.27 78.26 15.47
## Male Count 1121.00 16763.00 2673.00
## Row % 5.45 81.54 13.00
#Exhibit the change of health by gender via crosstab
gender_rate<-data.frame(df$gender,df$health_rate)
#remove the missing value
gender2<-drop_na(rename(gender_rate, gender1 =df.gender,health_rate1 = df.health_rate))
#build crosstab with count and row precentage
crosstab(gender2, row.vars = "gender1", col.vars = "health_rate1",type = c("f", "r"), style = "long",addmargins = FALSE)
## health_rate1 1. Excellent 2. Very good 3. Good 4. Fair 5. Poor
## gender1
## Female Count 1167.00 4383.00 11274.00 8890.00 2946.00
## Row % 4.07 15.29 39.34 31.02 10.28
## Male Count 976.00 3265.00 8432.00 5989.00 1896.00
## Row % 4.75 15.88 41.02 29.13 9.22
genderdf = data[,c('country','gender','health_change','health_rate')]%>% na.omit(.)
genderdfhealthrate <- genderdf %>% group_by(health_rate, gender) %>% tally()
genderdfhealthchange <- genderdf %>% group_by(health_change, gender) %>% tally()
fig <- plot_ly()
fig <- fig %>% add_pie(data = genderdfhealthrate %>% filter(gender == 'Male'), labels = ~factor(health_rate), values = ~n,
name = "Male: Start", textinfo='label+percent', domain = list(row = 0, column = 0))
fig <- fig %>% add_pie(data = genderdfhealthrate %>% filter(gender == 'Female'), labels = ~factor(health_rate), values = ~n,
name = "Female: Start", textinfo='label+percent', domain = list(row = 0, column = 1))
fig <- fig %>% add_pie(data = genderdfhealthchange %>% filter(gender == 'Male'), labels = ~factor(health_change), values = ~n,
name = "Male: 3 Months", textinfo='label+percent', domain = list(row = 1, column = 0),hole = 0.6)
fig <- fig %>% add_pie(data = genderdfhealthchange %>% filter(gender == 'Female'), labels = ~factor(health_change), values = ~n,
name = "Female: 3 Months", textinfo='label+percent', domain = list(row = 1, column = 1),hole = 0.6)
fig <- fig %>% layout(title = "Health Sentiments By Gender", showlegend = F,
grid=list(rows=2, columns=2),
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig%>%
layout(
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
# options(browser = 'false')
# api_create(fig, filename = "genderpiesubplots")
# https://chart-studio.plotly.com/~plotlys2022355/56
According to crosstab and Figures, regardless of male or female, the vast majority of respondents felt that their physical health had not changed, followed by worse and better. And female are more likely to consider their health is better or worse than male, possibly because women are more concerned about their health.
According to crosstab the Figures, overall, there were no significant differences in health between male and female. However, women are more likely to feel that their health status is fair and poor relatively.
1.3 Changes in health and health status of respondents by country
#Exhibit the change of health in different country by crosstab
country_change<-data.frame(df$country,df$health_change)
#remove the missing value
country1<-drop_na(rename(country_change, country1 =df.country,health_change1 = df.health_change))
#build crosstab with count and row precentage
crosstab(country1, row.vars = "country1", col.vars = "health_change1",type = c("f", "r"), style = "wide",addmargins = FALSE)
## Count Row %
## health_change1 1. Improved 2. About the same 3. Worsened 1. Improved 2. About the same 3. Worsened
## country1
## Austria 147.00 1839.00 326.00 6.36 79.54 14.10
## Belgium 302.00 2690.00 454.00 8.76 78.06 13.17
## Bulgaria 11.00 520.00 174.00 1.56 73.76 24.68
## Croatia 84.00 1495.00 331.00 4.40 78.27 17.33
## Cyprus 16.00 537.00 100.00 2.45 82.24 15.31
## Czech Republic 146.00 1636.00 307.00 6.99 78.31 14.70
## Denmark 134.00 1380.00 76.00 8.43 86.79 4.78
## Estonia 287.00 3155.00 623.00 7.06 77.61 15.33
## Finland 165.00 1033.00 113.00 12.59 78.79 8.62
## France 140.00 1491.00 224.00 7.55 80.38 12.08
## Germany 196.00 1568.00 272.00 9.63 77.01 13.36
## Greece 64.00 2987.00 342.00 1.89 88.03 10.08
## Hungary 27.00 737.00 98.00 3.13 85.50 11.37
## Israel 41.00 913.00 328.00 3.20 71.22 25.59
## Italy 81.00 2812.00 466.00 2.41 83.72 13.87
## Latvia 27.00 753.00 195.00 2.77 77.23 20.00
## Lithuania 80.00 920.00 258.00 6.36 73.13 20.51
## Luxembourg 63.00 698.00 104.00 7.28 80.69 12.02
## Malta 23.00 624.00 143.00 2.91 78.99 18.10
## Netherlands 85.00 578.00 67.00 11.64 79.18 9.18
## Poland 137.00 2101.00 557.00 4.90 75.17 19.93
## Portugal 74.00 803.00 194.00 6.91 74.98 18.11
## Romania 67.00 1157.00 243.00 4.57 78.87 16.56
## Slovakia 40.00 748.00 137.00 4.32 80.86 14.81
## Slovenia 194.00 2361.00 391.00 6.59 80.14 13.27
## Spain 73.00 1417.00 305.00 4.07 78.94 16.99
## Sweden 84.00 802.00 82.00 8.68 82.85 8.47
## Switzerland 129.00 1427.00 194.00 7.37 81.54 11.09
#Exhibit the health status in different country by crosstab
country_rate<-data.frame(df$country,df$health_rate)
#remove the missing value
country2<-drop_na(rename(country_rate, country1 =df.country,health_rate1 = df.health_rate))
#build crosstab with count and row precentage
crosstab(country2, row.vars = "country1", col.vars = "health_rate1",type = c("f", "r"), style = "wide",addmargins = FALSE)
## Count Row %
## health_rate1 1. Excellent 2. Very good 3. Good 4. Fair 5. Poor 1. Excellent 2. Very good 3. Good 4. Fair 5. Poor
## country1
## Austria 146.00 552.00 884.00 548.00 184.00 6.31 23.85 38.20 23.68 7.95
## Belgium 218.00 811.00 1491.00 762.00 168.00 6.32 23.51 43.22 22.09 4.87
## Bulgaria 14.00 65.00 299.00 250.00 78.00 1.98 9.21 42.35 35.41 11.05
## Croatia 102.00 394.00 706.00 480.00 228.00 5.34 20.63 36.96 25.13 11.94
## Cyprus 26.00 123.00 277.00 183.00 44.00 3.98 18.84 42.42 28.02 6.74
## Czech Republic 61.00 248.00 1051.00 566.00 162.00 2.92 11.88 50.34 27.11 7.76
## Denmark 237.00 621.00 383.00 274.00 74.00 14.92 39.08 24.10 17.24 4.66
## Estonia 44.00 195.00 1102.00 2028.00 699.00 1.08 4.79 27.09 49.85 17.18
## Finland 81.00 215.00 589.00 352.00 74.00 6.18 16.40 44.93 26.85 5.64
## France 86.00 286.00 833.00 476.00 173.00 4.64 15.43 44.93 25.67 9.33
## Germany 82.00 320.00 839.00 629.00 166.00 4.03 15.72 41.21 30.89 8.15
## Greece 130.00 745.00 1353.00 871.00 300.00 3.82 21.92 39.81 25.63 8.83
## Hungary 24.00 118.00 349.00 283.00 88.00 2.78 13.69 40.49 32.83 10.21
## Israel 59.00 225.00 386.00 433.00 176.00 4.61 17.59 30.18 33.85 13.76
## Italy 86.00 361.00 1412.00 1190.00 310.00 2.56 10.75 42.04 35.43 9.23
## Latvia 1.00 13.00 292.00 507.00 162.00 0.10 1.33 29.95 52.00 16.62
## Lithuania 25.00 45.00 467.00 595.00 125.00 1.99 3.58 37.15 47.33 9.94
## Luxembourg 37.00 179.00 405.00 179.00 67.00 4.27 20.65 46.71 20.65 7.73
## Malta 18.00 59.00 346.00 328.00 39.00 2.28 7.47 43.80 41.52 4.94
## Netherlands 99.00 148.00 299.00 154.00 30.00 13.56 20.27 40.96 21.10 4.11
## Poland 18.00 201.00 1324.00 927.00 327.00 0.64 7.19 47.34 33.14 11.69
## Portugal 24.00 51.00 285.00 462.00 249.00 2.24 4.76 26.61 43.14 23.25
## Romania 17.00 170.00 688.00 349.00 243.00 1.16 11.59 46.90 23.79 16.56
## Slovakia 83.00 149.00 491.00 172.00 30.00 8.97 16.11 53.08 18.59 3.24
## Slovenia 116.00 452.00 1339.00 740.00 299.00 3.94 15.34 45.45 25.12 10.15
## Spain 30.00 175.00 706.00 648.00 237.00 1.67 9.74 39.31 36.08 13.20
## Sweden 130.00 219.00 348.00 226.00 45.00 13.43 22.62 35.95 23.35 4.65
## Switzerland 149.00 508.00 762.00 267.00 65.00 8.51 29.01 43.52 15.25 3.71
The health outcomes for each countries
plottb <- data[,c('country','gender','agegroup','health_rate','health_change','test_positive')] %>%
group_by(health_rate, country, health_change)%>% tally()%>%
rename(., total='n')%>%na.omit(.)
#Percentage Computation
plottbgroup <- plottb %>% group_by(health_rate, country) %>%
summarise(Frequency = sum(total))
plottb <- left_join(plottb, plottbgroup, by=c("health_rate","country"))
plottb$percentage <- plottb$total / plottb$Frequency *100
plottb$percentage <- round(plottb$percentage, digits = 2)
plottb <- plottb %>% filter(health_change != '2. About the same')
# plot
E<- plottb %>%
ggplot(mapping = aes(y = reorder(health_rate, desc(health_rate)), x = health_change, text=paste(percentage,"%"), frame=country)) +
geom_tile(mapping = aes(fill = percentage, frame=country)) +
scale_fill_distiller(palette = "PuRd", direction = 1) +
theme_light() +
labs(title="% Health progression after 3 Months By Country",
y="Health reported at start",
x="Health condition after 3 months",
caption ='test',
color=NULL)+
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing panel outline
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank(), # get rid of minor grid
plot.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing plot outline
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
) + geom_text(aes(label = paste(percentage,"%")))
## Warning: Ignoring unknown aesthetics: frame
Eplotly<-ggplotly(E,tooltip = c("x", "text"))
Eplotly
# options(browser = 'false')
# api_create(Eplotly, filename = "HealthChangeCountry")
#https://chart-studio.plotly.com/~plotlys2022355/44
1.4 Changes in health and health status of respondents by agegroup
#Exhibit the change of health in different agegroup by crosstab
agegroup_change<-data.frame(df$agegroup,df$health_change)
#remove the missing value
agegroup1<-drop_na(rename(agegroup_change, agegroup1 =df.agegroup,health_change1 = df.health_change))
#build crosstab with count and row precentage
crosstab(agegroup1, row.vars = "agegroup1", col.vars = "health_change1",type = c("f", "r"), style = "wide",addmargins = FALSE)
## Count Row %
## health_change1 1. Improved 2. About the same 3. Worsened 1. Improved 2. About the same 3. Worsened
## agegroup1
## [0,10) 0.00 0.00 0.00 0.00 0.00 0.00
## [10,20) 0.00 0.00 0.00 0.00 0.00 0.00
## [20,30) 0.00 1.00 0.00 0.00 100.00 0.00
## [30,40) 0.00 19.00 0.00 0.00 100.00 0.00
## [40,50) 14.00 192.00 14.00 6.36 87.27 6.36
## [50,60) 372.00 4762.00 583.00 6.51 83.30 10.20
## [60,70) 1169.00 15232.00 2022.00 6.35 82.68 10.98
## [70,80) 982.00 13031.00 2510.00 5.94 78.87 15.19
## [80,90) 352.00 5330.00 1675.00 4.78 72.45 22.77
## [90,100) 28.00 608.00 295.00 3.01 65.31 31.69
## [100,110) 0.00 7.00 5.00 0.00 58.33 41.67
#Exhibit the health status in different agegroup by crosstab
agegroup_rate<-data.frame(df$agegroup,df$health_rate)
#remove the missing value
agegroup2<-drop_na(rename(agegroup_rate, agegroup1 =df.agegroup,health_rate1 = df.health_rate))
#build crosstab with count and row precentage
crosstab(agegroup2, row.vars = "agegroup1", col.vars = "health_rate1",type = c("f", "r"), style = "wide",addmargins = FALSE)
## Count Row %
## health_rate1 1. Excellent 2. Very good 3. Good 4. Fair 5. Poor 1. Excellent 2. Very good 3. Good 4. Fair 5. Poor
## agegroup1
## [0,10) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## [10,20) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## [20,30) 1.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
## [30,40) 8.00 9.00 2.00 0.00 0.00 42.11 47.37 10.53 0.00 0.00
## [40,50) 25.00 69.00 83.00 32.00 11.00 11.36 31.36 37.73 14.55 5.00
## [50,60) 409.00 1350.00 2489.00 1173.00 296.00 7.15 23.61 43.54 20.52 5.18
## [60,70) 1042.00 3532.00 8035.00 4701.00 1115.00 5.66 19.17 43.61 25.51 6.05
## [70,80) 523.00 2134.00 6607.00 5507.00 1758.00 3.16 12.91 39.97 33.32 10.64
## [80,90) 125.00 500.00 2264.00 3067.00 1407.00 1.70 6.79 30.75 41.65 19.11
## [90,100) 10.00 53.00 221.00 396.00 252.00 1.07 5.69 23.71 42.49 27.04
## [100,110) 0.00 1.00 5.00 3.00 3.00 0.00 8.33 41.67 25.00 25.00
We can explore health outcomes by age groups
plottb <- data[,c('country','gender','age','agegroup','health_rate','health_change','test_positive')]%>%
group_by(health_rate, agegroup, health_change)%>% tally()%>%
rename(., total='n')%>%na.omit(.)%>%
mutate(agegroup=recode(agegroup,
"[40,50)"='1. 40-50',
"[50,60)"='2. 50-60',
"[60,70)"='3. 60-70',
"[70,80)"='4. 70-80',
"[80,90)"='5. 80-90',
"[90,100)"='6. 90-100',
"[100,110)"='7. 100-110' ))
#Percentage Computation
plottbgroup <- plottb %>% group_by(health_rate, agegroup) %>%
summarise(Frequency = sum(total))
plottb <- left_join(plottb, plottbgroup, by=c("health_rate","agegroup"))
plottb$percentage <- plottb$total / plottb$Frequency *100
plottb$percentage <- round(plottb$percentage, digits = 2)
plottb <- plottb %>% filter(health_change != '2. About the same')
# plot
G<- plottb %>%
ggplot(mapping = aes(y = reorder(health_rate, desc(health_rate)), x = health_change, text=paste(percentage,"%"), frame=agegroup)) +
geom_tile(mapping = aes(fill = percentage, frame=agegroup)) +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
theme_light() +
labs(title="% Health progression after 3 Months By Age Group",
y="Health reported at start",
x="Health condition after 3 months",
caption ='test',
color=NULL)+
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing panel outline
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank(), # get rid of minor grid
plot.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing plot outline
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
)
## Warning: Ignoring unknown aesthetics: frame
# + geom_text(aes(label = paste(percentage,"%")))
Fplotly<-ggplotly(G,tooltip = c("x", "text"))
Fplotly
# options(browser = 'false')
# api_create(Eplotly, filename = "HealthChangeAgeGroup")
#https://chart-studio.plotly.com/~plotlys2022355/46
Next, this report translates the respondents’ physical health status into a point scale. The better the health, the higher the score (5 points for “1. Excellent” and 1 point for “5. Poor”). And we will also analyze the average health scores of respondents across different living area, gender, country and age group.
#Convert health_rate to numeric variable and recode the content
df$health_rate<-as.character(df$health_rate)
df$health_rate[df$health_rate == "1. Excellent"]<-5
df$health_rate[df$health_rate == "2. Very good"]<-4
df$health_rate[df$health_rate == "3. Good"]<-3
df$health_rate[df$health_rate == "4. Fair"]<-2
df$health_rate[df$health_rate == "5. Poor"]<-1
df$health_rate<-as.numeric(df$health_rate)
2.1 Average health rate of respondents living in different area
#Exhibit the average health rate in different living area
living_rate1<-data.frame(df$living_area,df$health_rate)
#remove the missing value
living3<-drop_na(rename(living_rate1, living_area1 =df.living_area,health_rate1 = df.health_rate))
#calculate the mean value, ranked as descending
arrange(living3 %>%
group_by(living_area1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean)),desc(health_rate))
## # A tibble: 5 x 2
## living_area1 health_rate
## <fct> <dbl>
## 1 2. The suburbs or outskirts of a big city 2.97
## 2 3. A large town 2.79
## 3 5. A rural area or village 2.77
## 4 4. A small town 2.76
## 5 1. A big city 2.67
#Exhibit the average health rate in different living area by ggplot
living3_a<-living3 %>%
group_by(living_area1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean))
#Build the barchart, ranked as descending
livingchart <- living3_a %>%
ggplot(aes(x=reorder(living_area1,-health_rate), y=health_rate))+
geom_bar(stat="identity", fill="wheat", alpha=.6, width=.9)+
scale_x_discrete(guide = guide_axis(n.dodge=3))+
labs(title="Figure 2.1 Average health rate of respondents by age group")+
xlab("living area") +
theme_bw()+
theme(axis.text.x = element_text(angle = 20, vjust = 0.5, hjust=1,(size=5)))
suppressWarnings(livingchart)
Respondents living in the suburbs or outskirts of a big city had the highest average health score (2.97), while those living in big cities had the lowest average health score (2.66). And the average score for the rest of the regions is around 2.75.
2.2 Average health rate of respondents by gender
#Exhibit the average health rate by gender
gender_rate1<-data.frame(df$gender,df$health_rate)
#remove the missing value
gender3<-drop_na(rename(gender_rate1, gender1 =df.gender,health_rate1 = df.health_rate))
#calculate the mean value
gender3 %>%
group_by(gender1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean))
## # A tibble: 2 x 2
## gender1 health_rate
## <fct> <dbl>
## 1 Female 2.72
## 2 Male 2.78
#Exhibit the average health rate by gender via ggplot
gender3_a<-gender3 %>%
group_by(gender1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean))
#build the bar chart
genderchart <- gender3_a %>%
ggplot(aes(x=gender1, y=health_rate))+
geom_bar(stat="identity", fill="deepskyblue4", alpha=.6, width=.9)+
labs(title="Figure 2.2 Average health rate of respondents by gender")+
xlab("gender") +
theme_bw()
genderchart
Similar to the results of “1.2 Changes in health and health status of respondents by Gender”, there was no significant difference in the average health scores of male and female. And the average score for men was only about 0.06 higher than that for women.
2.3 Average health rate of respondents in different countries
#Exhibit the average health rate in different country
country_rate1<-data.frame(df$country,df$health_rate)
#remove the missing value
country3<-drop_na(rename(country_rate1, country1 =df.country,health_rate1 = df.health_rate))
#calculate the mean value,show countries with the highest scores
head(arrange(country3 %>%
group_by(country1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean)),desc(health_rate)))
## # A tibble: 6 x 2
## country1 health_rate
## <fct> <dbl>
## 1 Denmark 3.42
## 2 Switzerland 3.23
## 3 Netherlands 3.18
## 4 Sweden 3.17
## 5 Slovakia 3.09
## 6 Belgium 3.04
#calculate the mean value,show countries with the lowest scores
head(arrange(country3 %>%
group_by(country1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean)),desc(-health_rate)))
## # A tibble: 6 x 2
## country1 health_rate
## <fct> <dbl>
## 1 Latvia 2.16
## 2 Portugal 2.20
## 3 Estonia 2.23
## 4 Lithuania 2.40
## 5 Spain 2.51
## 6 Poland 2.52
#Exhibit the average health rate in different country by ggplot
country3_a <- country3 %>%
group_by(country1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean))
#Build the barchart, ranked as descending
countrychart <- country3_a %>%
ggplot(aes(x=reorder(country1,-health_rate), y=health_rate))+
geom_bar(stat="identity", fill="limegreen", alpha=.6, width=.9)+
scale_x_discrete(guide = guide_axis(n.dodge=3))+
labs(title="Figure 2.3 Average health rate of respondents in different countries")+
xlab("country") +
theme_bw()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1,(size=5)))
suppressWarnings(countrychart)
The six countries with the highest average health scores are: Denmark, Switzerland, Netherlands, Sweden, Slovakia, Belgium. The average score for the above countries is over 3 points. And the six countries with the lowest average health scores were: Latvia, Portugal, Estonia, Lithuania, Spain, Poland. Among them, the average score of Latvia, Portugal and Estonia is only about 2 points.
2.4 Average health rate of respondents by age group
#Exhibit the average health rate in different agegroup
agegroup_rate1<-data.frame(df$agegroup,df$health_rate)
#remove the missing value
agegroup3<-drop_na(rename(agegroup_rate1, agegroup1 =df.agegroup,health_rate1 = df.health_rate))
#calculate the mean value
arrange(agegroup3 %>%
group_by(agegroup1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean)))
## # A tibble: 9 x 2
## agegroup1 health_rate
## <fct> <dbl>
## 1 [20,30) 5
## 2 [30,40) 4.32
## 3 [40,50) 3.30
## 4 [50,60) 3.07
## 5 [60,70) 2.93
## 6 [70,80) 2.65
## 7 [80,90) 2.30
## 8 [90,100) 2.11
## 9 [100,110) 2.33
#Exhibit the average health rate in different agegroup by ggplot
agegroup3_a <- agegroup3 %>%
group_by(agegroup1) %>%
summarise_at(vars(health_rate1), list(health_rate = mean))
#Build the barchart, ranked as descending
agechart <- agegroup3_a %>%
ggplot(aes(x=reorder(agegroup1,-health_rate), y=health_rate))+
geom_bar(stat="identity", fill="darkorchid", alpha=.6, width=.9)+
labs(title="Figure 2.4 Average health rate of respondents by age group")+
xlab("age group") +
theme_bw()
agechart
As can be seen from Figure 2.4, the older the respondents, the lower the average health score.
Interactive Sub-Plot
plotsubs <- subplot(ggplotly(genderchart,tooltip = c("y", "text")),
ggplotly(agechart,tooltip = c("y", "text")),
ggplotly(livingchart,tooltip = c("y", "text")),
ggplotly(countrychart,tooltip = c("y", "text"))
, nrows = 2) %>%
layout(title = 'Health Rate Scoring (Gender, Age Group, Area, Country)',
plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'))
plotsubs
# options(browser = 'false')
# api_create(plotsubs, filename = "plotsubs")
# https://chart-studio.plotly.com/~plotlys2022355/58/#/
Map of Health Scores
# 1) Load Map Dataset from Library 2) Left Join Against Share8 Data 3) Filter out unrelated countries
mapdata3 <- map_data("world") %>%
left_join(., country3_a %>% rename(., region = country1), by="region") %>%
filter(!is.na(.$health_rate))
map3 <- ggplot(mapdata3, aes( x = long, y = lat, group=group, text=region)) +
geom_polygon(color = "white",aes(fill = health_rate)) +
scale_fill_gradient(name = "HealthScoring",
low = munsell::mnsl("5Y 7/8"),
high = munsell::mnsl("5G 7/8"),
na.value = "grey50"
)+
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.y=element_blank(),
axis.title.x=element_blank(),
rect = element_rect(fill = "transparent")
)+
theme(
panel.background = element_rect(fill = "transparent", colour = NA_character_),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "transparent",
colour = NA_character_),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
)
MapScore <- ggplotly(map3,tooltip = c("fill","text"))
MapScore
# options(browser = 'false')
# api_create(map3, filename = "healthmap")
# https://chart-studio.plotly.com/~plotlys2022355/60
Other visualization
We seek to understand how covid diagnosis influence Health Sentiment After 3 Months
#Create and Join Additional Tibble for percentage calculation
HealthChange <- data %>% group_by(country,health_change,test_positive) %>% tally() %>% na.omit(.)
HealthChange2 <- HealthChange %>% group_by(country,test_positive) %>%
summarise(Frequency = sum(n))
HealthChange <- left_join(HealthChange, HealthChange2, by=c("country","test_positive"))
#Compute Percentages
HealthChange$percentage <- HealthChange$n / HealthChange$Frequency *100
HealthChange$percentage <- round(HealthChange$percentage, digits = 2)
#Filter irrelevant data point
HealthChange <- HealthChange %>% filter(health_change != '2. About the same')%>% select(country,health_change,test_positive,percentage)
HealthChange <- rename(HealthChange, sentiment='health_change')
HealthChange$pctsort <- HealthChange$percentage
HealthChange$pctsort[HealthChange$sentiment=='1. Improved'] <- 0
#Rename
HealthChange <- HealthChange%>%
mutate(sentiment=recode(sentiment,
"1. Improved"='Improved',
"3. Worsened"='Worsened'))
p <- HealthChange %>%
ggplot(aes(x= percentage, y= reorder(country,pctsort),text=(country), frame = test_positive)) +
geom_line(aes(group = country, frame = test_positive),color="grey")+
geom_point(aes(color=sentiment, frame = test_positive), size=4) +
labs(y="country")+
theme_classic()+
scale_color_brewer(palette = "Dark2")+
labs(title = "Percentage % reporting change in personal health condition", x = "%",
y = "")+
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing panel outline
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank(), # get rid of minor grid
plot.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing plot outline
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
)
## Warning: Ignoring unknown aesthetics: frame
## Ignoring unknown aesthetics: frame
fig <- ggplotly(p,tooltip = c("x","text")) %>%
animation_opts(
1000, easing = "elastic", redraw = FALSE
) %>%
animation_button(
x = 1, xanchor = "right", y = 0, yanchor = "bottom"
) %>%
animation_slider(
currentvalue = list(prefix = "Tested Positive ", font = list(color="black"))
)
fig
# options(browser = 'false')
# api_create(fig, filename = "HealthConditionSliders")
#https://chart-studio.plotly.com/~plotlys2022355/26
Chart shows % of respondents reporting their health outcomes and whether are they improving or worsening.
It is interesting to note for covid positive resposdents residing in “north-western” european countries,
There are greater proportion of responsdents that reported improving health as compared to worsening.
GEOM TILE HEAT MAP
Alternatively, we also compare what respondents reported during the start of the survey and how are they feeling after 3 months. We subplot between covide diagnosed and non respectively. Result are plot on a tile heatmat.
A <- data[,c('country','gender','agegroup','health_rate','health_change','test_positive')]
Otherdf <- A%>% filter(test_positive!=1) %>%
group_by(health_rate, health_change)%>% tally()%>%
rename(., total='n')%>%na.omit(.)
Otherdf$covidstatus = 'Not Positive'
positivedf <- A%>% filter(test_positive==1)%>%
group_by(health_rate, health_change)%>% tally()%>%
rename(., total='n')%>%na.omit(.)
positivedf$covidstatus = 'Positive'
plottb <- union_all(Otherdf, positivedf)
#Percentage Computation
plottbgroup <- plottb %>% group_by(health_rate,covidstatus) %>%
summarise(Frequency = sum(total))
plottb <- left_join(plottb, plottbgroup, by=c("health_rate","covidstatus"))
plottb$percentage <- plottb$total / plottb$Frequency *100
plottb$percentage <- round(plottb$percentage, digits = 2)
plottb <- plottb %>% filter(health_change != '2. About the same')
# plot
E<- plottb %>%
ggplot(mapping = aes(y = reorder(health_rate, desc(health_rate)), x = health_change, text=paste(percentage,"%"))) +
geom_tile(mapping = aes(fill = percentage)) +
scale_fill_distiller(palette = "Purples", direction = 1) +
theme_light() +
labs(title="% Health progression after 3 Months (Covid Negative vs Positive respondents)",
y="Health reported at start",
x="Health condition after 3 months",
color=NULL)+
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing panel outline
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank(), # get rid of minor grid
plot.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing plot outline
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
)
F <- E + facet_grid(cols = vars(covidstatus)) + geom_text(aes(label = paste(percentage,"%")))
Eplotly<-ggplotly(F,tooltip = c("x", "text"))
Eplotly
# options(browser = 'false')
# api_create(Eplotly, filename = "HealthChangeSubplots")
#https://chart-studio.plotly.com/~plotlys2022355/40
As expected, covid diagnosis magnifies the impact of worsening health
We can also look at how age and covid both affects health outcomes
A <- data[,c('country','gender','age','health_change','test_positive')]%>%na.omit(.)
HlthChangegp <- A %>% group_by(age, health_change, test_positive) %>% tally()
#% Computation
Hlthgp <- A %>% group_by(age, test_positive) %>% tally()%>% rename(., total='n')
HlthChangegp <- merge(HlthChangegp, Hlthgp, by=c('age','test_positive'), all.x=TRUE)
HlthChangegp$percentage <- HlthChangegp$n/HlthChangegp$total * 100
HlthChangegp$percentage <- round(HlthChangegp$percentage, digits = 2)
#Filter out irrelevant data points and outliers
HlthChangegp <- HlthChangegp %>% filter(health_change != '2. About the same')%>% select(age,health_change,n, total, percentage,test_positive)
HlthChangegp <- HlthChangegp%>% filter(age>=50 & age<90)
# plot
D <- ggplot(HlthChangegp, aes(x=age)) +
geom_line(aes(y=percentage, col=health_change, frame = test_positive)) +
labs(title="% reporting changes in health by age",
# subtitle="Drawn from Long Data format",
# caption="Source: Economics",
y="Percentage %",
color=NULL) +
scale_color_manual(labels = c("1. Improved", "3. Worsened"),
values = c("1. Improved"="#00ba38", "3. Worsened"="#f8766d"))+
theme(
panel.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing panel outline
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank(), # get rid of minor grid
plot.background = element_rect(fill = "transparent",
colour = NA_character_), # necessary to avoid drawing plot outline
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent")
# ,legend.title=element_text(color="white")
# ,legend.text=element_text(color="white")
# ,plot.title=element_text(color="white")
# ,axis.text.y=element_text(color="white")
# ,axis.text.x=element_text(color="white")
# ,axis.title.y=element_text(color="white")
# ,axis.title.x=element_text(color="white")
) # turn off minor grid
## Warning: Ignoring unknown aesthetics: frame
Dplotly<-ggplotly(D,tooltip = c("x","text","text")) %>%
animation_opts(
1000, easing = "elastic", redraw = FALSE
) %>%
animation_button(
x = 1, xanchor = "right", y = 0, yanchor = "bottom"
) %>%
animation_slider(
currentvalue = list(prefix = "Tested Positive ", font = list(color="black"))
)
Dplotly
# options(browser = 'false')
# api_create(Dplotly, filename = "AgeAnimationDistribution")
#https://chart-studio.plotly.com/~plotlys2022355/24
Naturally, health is going to progression worsen with age. This effect is magnified with covid diagnosis. However, we also noted increase % of improved health outcome for older covid patient. Could this be related to medical facilities in different countries?
Finally we can look at the prevalence of Sequaia against Covid Status
sequelaedf <- data[,c('country','agegroup','test_positive','sequelae1','sequelae2','sequelae3','sequelae4','sequelae5','sequelae6','sequelae7','sequelae8')]
#Assign 0 to NA, in order gather sequalae sample results for participant that did not fill in covid test surveys.
sequelaedf$test_positive[is.na(sequelaedf$test_positive)]=0
# Result for covid positive respondents
sequelaepos <- sequelaedf %>%
filter(test_positive==1) %>%
select(c('country','sequelae1','sequelae2','sequelae3','sequelae4','sequelae5','sequelae6','sequelae7','sequelae8')) %>%
na.omit
# dim(sequelaepos)
#1 Aggregate the occurence of each sequalae by country
#2 Tally number of row by country to act as denominator to compute percentages.
#3 left join both tibbles together
sequelaeposgroup<- left_join(
aggregate(sequelaepos[2:9], by=list(sequelaepos$country), FUN=c('sum')) %>% rename(., country='Group.1'),
sequelaepos %>% count(country, sort = TRUE),
by="country")
#Summarize additional result "sum of all countries together", tibble are union together.
sequelaeposgroup <- union_all(
sequelaeposgroup,
sequelaeposgroup[2:10] %>%
summarise(sequelae1 = sum(sequelae1), sequelae2 = sum(sequelae2), sequelae3 = sum(sequelae3), sequelae4 = sum(sequelae4),
sequelae5 = sum(sequelae5), sequelae6 = sum(sequelae6), sequelae7 = sum(sequelae7), sequelae8 = sum(sequelae8),
n = sum(n)) %>%
mutate(country = 'All')
)
# dim(sequelaeposgroup)
#Similar Approach to the Code Row Above, except this is done for covid negative / other participants
sequelaenotpos <- sequelaedf %>%
filter(test_positive!=1) %>%
select(c('country','sequelae1','sequelae2','sequelae3','sequelae4','sequelae5','sequelae6','sequelae7','sequelae8'))%>%
na.omit
# dim(sequelaenotpos)
sequelaenotposgroup<- left_join(
aggregate(sequelaenotpos[2:9], by=list(sequelaenotpos$country), FUN=c('sum')) %>% rename(., country='Group.1'),
sequelaenotpos %>% count(country, sort = TRUE),
by="country")
sequelaenotposgroup <- union_all(
sequelaenotposgroup,
sequelaenotposgroup[2:10] %>%
summarise(sequelae1 = sum(sequelae1), sequelae2 = sum(sequelae2), sequelae3 = sum(sequelae3), sequelae4 = sum(sequelae4),
sequelae5 = sum(sequelae5), sequelae6 = sum(sequelae6), sequelae7 = sum(sequelae7), sequelae8 = sum(sequelae8),
n = sum(n)) %>%
mutate(country = 'All')
)
# dim(sequelaenotposgroup)
#Union the covid positive and other groupings together
sequelaegroupcombined <- union_all(
sequelaeposgroup%>%mutate(covid_status = 'positive')
,
sequelaenotposgroup%>%mutate(covid_status = 'others')
)
# head(sequelaegroupcombined)
#Compute Percentage Occurence for all 8 sequalae, assign to new columns
for (val in (1:8))
{ calccol = paste('sequelae',val, sep = "")
newcol = paste('s',val, sep = "")
sequelaegroupcombined[newcol] <- round(sequelaegroupcombined[calccol]/sequelaegroupcombined$n*100,1)
}
# dim(sequelaegroupcombined)
# head(sequelaegroupcombined)
sequelaegroupchart <- sequelaegroupcombined %>%
select(country,covid_status,s1,s2,s3,s4,s5,s6,s7,s8) %>%
pivot_longer(., cols =c(s1,s2,s3,s4,s5,s6,s7,s8), names_to = "Sequelae", values_to = "percentage") %>%
mutate(Sequelae=recode(Sequelae, #Recoded for chart interpretation
"s1"='Fatigue',
"s2"='Cough, Shortness of Breath',
"s3"='Loss of Taste/Smell',
"s4"='Headache',
"s5"='Body/JointPains',
"s6"='Chest/Abdominal Pains',
"s7"='Diarrhoea',
"s8"='Confusion')) %>%
rename(., Covid_Status='covid_status')
G<-ggplot(sequelaegroupchart)+
geom_point(aes(x = reorder(Sequelae,percentage), y = percentage, colour = Covid_Status, frame=country, text=paste(Sequelae,": ", percentage,"%")),
position = position_dodge2(width = 1))+
coord_flip() +
theme(axis.text.y = element_text(angle = 45, vjust = 0.5, hjust=1,(size=5)))+
labs(title="Sequelae reported % by Covid diagnosis",
y="%",
x="",
color=NULL)+
scale_color_brewer(palette = "Dark2")+
theme(panel.background = element_rect(fill = "transparent", colour = NA_character_),
panel.grid = element_line(color = "Grey",size = 0.55,linetype = 2),
plot.background = element_rect(fill = "transparent", colour = NA_character_)
)
## Warning: Ignoring unknown aesthetics: frame, text
Gplotly<-ggplotly(G,tooltip = c("colour", "text"))
Gplotly
# options(browser = 'false')
# api_create(Gplotly, filename = "SequelaePercentage")
In General, Covid Positive respondents reported higher occurence of sequelae symptoms as opposed to those who are not diagnosed with covid. We do not that this might not be the case when looking at individual countries.
Modelling
As a whole, the share8 dataset provided contains 318 columns (+2 derived age variable) with 49253 observation. Because of the dependent variable(health_change) with three responses(1. Improved 2. About the same 3. Worsened),we use Multinomial Logistic Regression to build model.
attach(data)
## The following objects are masked from data (pos = 4):
##
## age, agegroup, cac102_, cac103_1, cac103_2, cac103_3, cac103_3b,
## cac103_4, cac103_4b, cac103_5, cac103_5b, cac103_6, cac103_6b,
## cac103_7, cac103_7b, cac103_8, cac103_8b, cac103_97, cac103_97b,
## cac104_, cac105_2, cac105_3, cac105_3b, cac105_4, cac105_4b,
## cac105_5, cac105_5b, cac105_6, cac105_6b, cac105_7, cac105_7b,
## cac105_8, cac105_8b, cac105_97, cac105_97b, cac110_, cac111_2,
## cac111_3, cac111_3b, cac111_4, cac111_4b, cac111_5, cac111_5b,
## cac111_6, cac111_6b, cac111_7, cac111_7b, cac111_8, cac111_8b,
## cac111_97, cac111_97b, cac113_, cac114_2, cac114_3, cac114_3b,
## cac114_4, cac114_4b, cac114_5, cac114_5b, cac114_6, cac114_6b,
## cac114_7, cac114_7b, cac114_8, cac114_8b, cac114_97, cac114_97b,
## cac120_97, cac120_98, cac130_, cac131_, cac140_, cac142_, cac143_,
## caco107_, cadn002_, cae001_, cae100_, cae103_, cae104_1, cae104_2,
## cae104_3, cae104_4, cae104_97, cae105e, cae106_1, cae106_2,
## cae106_3, cae106_4, cae107e, cae108_1, cae108_2, cae108_3,
## cae108_4, cae109_1, cae109_2, cae109_3, cae109_4, cae109_5,
## cae109_6, cae109_7, cae109_8, cae109_98, cae111_, cae112_,
## cae114_1, cae114_2, cae114_3, cae114_4, cae120_, caep005_,
## caep100_, caep101_1, caep101_2, caep102_, caep103_, caf001_,
## caf002_, caf005_, cah004_1, cah004_2, cah004_3, cah004_4, cah004_5,
## cah004_6, cah004_7, cah006_, cah007_1, cah007_2, cah007_3,
## cah007_4, cah007_5, cah007_6, cah007_7, cah017_, cah020_, cah110_,
## cah111_11, cah111_3, cah111_6, cah111_7, cah111_8, cah113_,
## cah116_, cah121_1, cah121_2, cahc117_, cahc118_, cahc119_,
## cahc884_, caho032_, caho100_, caho136_, cait104_, cait105_,
## cait106_3, cait106_4, cait106_5, cait106_6, camh002_, camh007_,
## camh037_, camh113_1, camh113_2, camh118_1, camh118_2, camh148_,
## caph089_1, caph089_2, caph089_3, caph089_4, caph105_, caq105_,
## caq106_1, caq106_2, caq106_3, caq106_4, caq106_97, caq110_,
## caq111_1, caq111_2, caq111_3, caq111_4, caq111_97, caq115_,
## caq116_1, caq116_2, caq116_3, caq116_4, caq116_97, caq118_,
## caq119_, caq120_, caq121_, caq122_, caq123_1, caq123_2, caq123_3,
## caq123_4, caq123_5, caq123_97, caq125_, caq127_, caq128_1,
## caq128_2, caq128_3, caq128_4, caq128_5, caq128_97, caq130_1,
## caq130_2, caq130_3, caq130_4, caq130_97, cas103_1, cas103_2,
## cas103_3, cas103_4, cas103_5, cas104_1, cas104_2, cas104_3,
## cas104_4, cas110_1, cas110_2, cas110_3, cas110_4, cas111_1,
## cas111_2, cas111_3, cas111_4, cas112_1, cas112_2, cas112_3,
## cas112_4, cas113_1, cas113_2, cas113_3, cas113_4, cas115_, cas116_,
## cas120_1, cas120_2, cas120_3, cas120_4, cas121_1, cas121_2,
## cas121_3, cas121_4, cas125_, cas126_, cas127_1, cas127_2, cas127_3,
## cas127_4, cas127_5, cas130_1, cas130_2, cas130_3, cas130_4,
## cas130_5, cas131_1, cas131_2, cas131_3, cas131_4, cas131_5,
## cas140_, casr006_, caw102_, caw103_, caw110_1, caw110_2, caw110_3,
## caw111_, caw117_, caw121_, caw122_, caw123_1, caw123_2, caw123_3,
## caw123_4, caw124_, caw125_, caw126_1, caw126_2, caw126_3, caw126_4,
## country, coupleid9ca, drugs_taken, exrate, gender, health_change,
## health_rate, hhid9ca, hospitalized, language_ca, living_area,
## mergeid, mergeidp9ca, sequelae1, sequelae2, sequelae3, sequelae4,
## sequelae5, sequelae6, sequelae7, sequelae8, test_positive, yearborn
#select variables
variable<-data.frame(health_change,sequelae1,sequelae2,sequelae3,sequelae4,sequelae5,sequelae6,sequelae7,sequelae8)
#check missing value
head(is.na(variable))
## health_change sequelae1 sequelae2 sequelae3 sequelae4 sequelae5 sequelae6
## [1,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## sequelae7 sequelae8
## [1,] TRUE TRUE
## [2,] TRUE TRUE
## [3,] TRUE TRUE
## [4,] TRUE TRUE
## [5,] TRUE TRUE
## [6,] TRUE TRUE
#drop missing values
used<-na.omit(variable)
#check again
head(is.na(used))
## health_change sequelae1 sequelae2 sequelae3 sequelae4 sequelae5 sequelae6
## 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 44 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 45 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 69 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## sequelae7 sequelae8
## 8 FALSE FALSE
## 9 FALSE FALSE
## 24 FALSE FALSE
## 44 FALSE FALSE
## 45 FALSE FALSE
## 69 FALSE FALSE
#change the value
used <- used %>%
mutate(health_change=recode(health_change,
"1. Improved"=2,
"2. About the same"=1,
"3. Worsened"=0))
The values for individual sequelae were already recoded to 1s and 0s earlier during data cleaning, hence it is not needed to be handled on this step.
str(used)
## 'data.frame': 4071 obs. of 9 variables:
## $ health_change: num 1 1 1 1 1 1 1 0 1 1 ...
## $ sequelae1 : num 0 0 0 0 0 1 1 1 0 0 ...
## $ sequelae2 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ sequelae3 : num 1 1 0 0 0 0 0 0 0 1 ...
## $ sequelae4 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ sequelae5 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ sequelae6 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sequelae7 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sequelae8 : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:45182] 1 2 3 4 5 6 7 10 11 12 ...
## ..- attr(*, "names")= chr [1:45182] "1" "2" "3" "4" ...
#change the data type
used[ colnames(used) ] <- lapply(used[colnames(used) ], factor)
#check the change successfully or not
str(used)
## 'data.frame': 4071 obs. of 9 variables:
## $ health_change: Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 1 2 2 ...
## $ sequelae1 : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 1 1 ...
## $ sequelae2 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ sequelae3 : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 2 ...
## $ sequelae4 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ sequelae5 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ sequelae6 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sequelae7 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sequelae8 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:45182] 1 2 3 4 5 6 7 10 11 12 ...
## ..- attr(*, "names")= chr [1:45182] "1" "2" "3" "4" ...
#Load the multinom package
library(nnet)
#Since we are going to use health_change="3. Worsened" as the reference group, we need relevel the group.
typeof(used$health_change)
## [1] "integer"
levels(used$health_change)
## [1] "0" "1" "2"
#define new variable and make the reference
used$newhealth_change<-relevel(used$health_change,ref = 1)
#include nothing
null_model<-multinom(used$newhealth_change~1)
## # weights: 6 (2 variable)
## initial value 4472.450627
## iter 10 value 3469.508515
## iter 10 value 3469.508513
## iter 10 value 3469.508513
## final value 3469.508513
## converged
summary(null_model)
## Call:
## multinom(formula = used$newhealth_change ~ 1)
##
## Coefficients:
## (Intercept)
## 1 1.1610967
## 2 -0.5462245
##
## Std. Errors:
## (Intercept)
## 1 0.0392357
## 2 0.0565387
##
## Residual Deviance: 6939.017
## AIC: 6943.017
#include all independent variables
full_model<-multinom(used$newhealth_change~used$sequelae1+used$sequelae2+used$sequelae3+used$sequelae4+used$sequelae5+used$sequelae6+used$sequelae7+used$sequelae8)
## # weights: 30 (18 variable)
## initial value 4472.450627
## iter 10 value 3362.484693
## iter 20 value 3325.982340
## final value 3324.788672
## converged
summary(full_model)
## Call:
## multinom(formula = used$newhealth_change ~ used$sequelae1 + used$sequelae2 +
## used$sequelae3 + used$sequelae4 + used$sequelae5 + used$sequelae6 +
## used$sequelae7 + used$sequelae8)
##
## Coefficients:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 1.7029409 -0.47844972 -0.3240059 0.3943957 -0.1666516
## 2 -0.4382988 0.08900326 -0.2559248 0.2998691 -0.6068302
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 -0.1539485 -0.4322652 -0.25862549 -0.7510225
## 2 0.1839554 -0.1678917 0.02382543 0.0435898
##
## Std. Errors:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 0.06755890 0.09670772 0.09239625 0.09708501 0.1020431
## 2 0.09897365 0.13838066 0.13048941 0.13349540 0.1460793
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 0.1020096 0.1191702 0.1334782 0.1416943
## 2 0.1419204 0.1643516 0.1773703 0.1775956
##
## Residual Deviance: 6649.577
## AIC: 6685.577
#three methods of selecting variable
step(full_model,direction = "both",trace = 0)
## trying - used$sequelae1
## trying - used$sequelae2
## trying - used$sequelae3
## trying - used$sequelae4
## trying - used$sequelae5
## trying - used$sequelae6
## trying - used$sequelae7
## trying - used$sequelae8
## Call:
## multinom(formula = used$newhealth_change ~ used$sequelae1 + used$sequelae2 +
## used$sequelae3 + used$sequelae4 + used$sequelae5 + used$sequelae6 +
## used$sequelae7 + used$sequelae8)
##
## Coefficients:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 1.7029409 -0.47844972 -0.3240059 0.3943957 -0.1666516
## 2 -0.4382988 0.08900326 -0.2559248 0.2998691 -0.6068302
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 -0.1539485 -0.4322652 -0.25862549 -0.7510225
## 2 0.1839554 -0.1678917 0.02382543 0.0435898
##
## Residual Deviance: 6649.577
## AIC: 6685.577
step(full_model,direction = "forward",trace = 0)
## Call:
## multinom(formula = used$newhealth_change ~ used$sequelae1 + used$sequelae2 +
## used$sequelae3 + used$sequelae4 + used$sequelae5 + used$sequelae6 +
## used$sequelae7 + used$sequelae8)
##
## Coefficients:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 1.7029409 -0.47844972 -0.3240059 0.3943957 -0.1666516
## 2 -0.4382988 0.08900326 -0.2559248 0.2998691 -0.6068302
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 -0.1539485 -0.4322652 -0.25862549 -0.7510225
## 2 0.1839554 -0.1678917 0.02382543 0.0435898
##
## Residual Deviance: 6649.577
## AIC: 6685.577
step(full_model,direction = "backward",trace = 0)
## trying - used$sequelae1
## trying - used$sequelae2
## trying - used$sequelae3
## trying - used$sequelae4
## trying - used$sequelae5
## trying - used$sequelae6
## trying - used$sequelae7
## trying - used$sequelae8
## Call:
## multinom(formula = used$newhealth_change ~ used$sequelae1 + used$sequelae2 +
## used$sequelae3 + used$sequelae4 + used$sequelae5 + used$sequelae6 +
## used$sequelae7 + used$sequelae8)
##
## Coefficients:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 1.7029409 -0.47844972 -0.3240059 0.3943957 -0.1666516
## 2 -0.4382988 0.08900326 -0.2559248 0.2998691 -0.6068302
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 -0.1539485 -0.4322652 -0.25862549 -0.7510225
## 2 0.1839554 -0.1678917 0.02382543 0.0435898
##
## Residual Deviance: 6649.577
## AIC: 6685.577
#Check the Z-score (wald Z)
z <- summary(full_model)$coefficients/summary(full_model)$standard.errors
z
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 25.206759 -4.947379 -3.506700 4.062375 -1.633150
## 2 -4.428439 0.643177 -1.961269 2.246288 -4.154116
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 -1.509157 -3.627294 -1.9375854 -5.3003006
## 2 1.296187 -1.021540 0.1343259 0.2454441
#2-tailed z test
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 0.00000e+00 7.521952e-07 0.0004537001 4.857601e-05 1.024375e-01
## 2 9.49175e-06 5.201092e-01 0.0498477057 2.468557e-02 3.265482e-05
## used$sequelae51 used$sequelae61 used$sequelae71 used$sequelae81
## 1 0.1312586 0.0002864071 0.05267383 1.156122e-07
## 2 0.1949111 0.3069987307 0.89314485 8.061126e-01
#sequelae5 and sequelae7 are not significant at 5% in any level,try to remove them.
model1<-multinom(used$newhealth_change~used$sequelae1+used$sequelae2+used$sequelae3+used$sequelae4+used$sequelae6+used$sequelae8)
## # weights: 24 (14 variable)
## initial value 4472.450627
## iter 10 value 3362.004536
## iter 20 value 3331.830755
## final value 3331.830674
## converged
summary(model1)
## Call:
## multinom(formula = used$newhealth_change ~ used$sequelae1 + used$sequelae2 +
## used$sequelae3 + used$sequelae4 + used$sequelae6 + used$sequelae8)
##
## Coefficients:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 1.6910659 -0.5194292 -0.3362381 0.3604355 -0.2184148
## 2 -0.4265207 0.1392990 -0.2432110 0.3229901 -0.5672804
## used$sequelae61 used$sequelae81
## 1 -0.4955024 -0.81367586
## 2 -0.1279803 0.06699625
##
## Std. Errors:
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 0.06722951 0.0938762 0.09207328 0.09587301 0.09913518
## 2 0.09838977 0.1337368 0.13009589 0.13192214 0.14249512
## used$sequelae61 used$sequelae81
## 1 0.1162881 0.1392235
## 2 0.1603715 0.1742403
##
## Residual Deviance: 6663.661
## AIC: 6691.661
z1<- summary(model1)$coefficients/summary(model1)$standard.errors
z1
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 25.153624 -5.533129 -3.651853 3.759509 -2.203201
## 2 -4.335011 1.041591 -1.869475 2.448339 -3.981051
## used$sequelae61 used$sequelae81
## 1 -4.2609911 -5.8443877
## 2 -0.7980241 0.3845048
p1<- (1 - pnorm(abs(z1), 0, 1)) * 2
p1
## (Intercept) used$sequelae11 used$sequelae21 used$sequelae31 used$sequelae41
## 1 0.000000e+00 3.145677e-08 0.0002603549 0.0001702468 2.758055e-02
## 2 1.457531e-05 2.976013e-01 0.0615567469 0.0143516534 6.861113e-05
## used$sequelae61 used$sequelae81
## 1 2.035223e-05 5.084349e-09
## 2 4.248565e-01 7.006043e-01
Interpretation
We build null_model at first as reference, and then we build full_model. The AIC decreases from 6943.017 to 6685.577, which means the full_model is better than null_model. The three methods of selecting variables to build the best model tell us the same results. The outputs of “both”“forward”“backward” are full_model with the lowest AIC. g1(x)=ln[p(Y=1|X)/p(Y=0|x)]=1.70-0.48sequelae1-0.32sequelae2+0.39sequelae3-0.17sequelae4-0.15sequelae5-0.43sequelae6-0.26sequelae7-0.75sequelae8 g2(x)=ln[p(Y=2|X)/p(Y=0|x)]=-0.44+0.09sequelae1-0.26sequelae2+0.30sequelae3-0.61sequelae4+0.18sequelae5-0.17sequelae6+0.02sequelae7+0.04sequelae8 The meaning of coefficient is that for example β11 represents x1 changes 1 unit, odds ratio[p(Y=1|X)/p(Y=0|x)] changes e(-0.48) times, which means x1 increases, odds ratio decreases because the value of e(-0.48) is between 0 and 1. The research aims to find out which sequelae can significantly impact physical health in infected individuals. From the p-value table, we can see sequelae5 and sequelae7 are not significant at 5%. We try to remove them to build model1. All independent variables are significant in model1, though the p-value of sequelae 2 in g2(x) is little more than 0.05, which is 0.06, it can be accepted. Overall, body aches, joint pain attributed to respondent’s Covid illness(sequelae5) and diarrhoea, nausea attributed to respondent’s Covid illness(sequelae7) can’t significantly impact physical health in infected individuals.
Resampling
Resampling methods and show how it improve your model performance Usually, the bootstrap method process is: 1. specify the number of samples “k”. obtain a new sample of the same size as the original sample by resampling from the original sample. 2. estimate the intercept term and slop of the samples “k”.
#load the boot package
library(boot)
#check the dataset
str(used)
## 'data.frame': 4071 obs. of 10 variables:
## $ health_change : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 1 2 2 ...
## $ sequelae1 : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 1 1 ...
## $ sequelae2 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ sequelae3 : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 2 ...
## $ sequelae4 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ sequelae5 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ sequelae6 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sequelae7 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sequelae8 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ newhealth_change: Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 1 2 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:45182] 1 2 3 4 5 6 7 10 11 12 ...
## ..- attr(*, "names")= chr [1:45182] "1" "2" "3" "4" ...
summary(used)
## health_change sequelae1 sequelae2 sequelae3 sequelae4 sequelae5 sequelae6
## 0: 853 0:2016 0:2593 0:2804 0:2910 0:2727 0:3478
## 1:2724 1:2055 1:1478 1:1267 1:1161 1:1344 1: 593
## 2: 494
## sequelae7 sequelae8 newhealth_change
## 0:3667 0:3732 0: 853
## 1: 404 1: 339 1:2724
## 2: 494
head(used,10)
## health_change sequelae1 sequelae2 sequelae3 sequelae4 sequelae5 sequelae6
## 8 1 0 0 1 0 0 0
## 9 1 0 0 1 0 0 0
## 24 1 0 0 0 0 0 0
## 44 1 0 0 0 0 0 0
## 45 1 0 0 0 0 0 0
## 69 1 1 0 0 0 0 0
## 70 1 1 0 0 0 0 0
## 102 0 1 0 0 0 0 0
## 112 1 0 1 0 1 1 0
## 180 1 0 0 1 0 0 0
## sequelae7 sequelae8 newhealth_change
## 8 0 0 1
## 9 0 0 1
## 24 0 0 1
## 44 0 0 1
## 45 0 0 1
## 69 0 0 1
## 70 0 0 1
## 102 0 0 0
## 112 0 0 1
## 180 0 0 1
#check the dataset's type
lapply(used,class)
## $health_change
## [1] "factor"
##
## $sequelae1
## [1] "factor"
##
## $sequelae2
## [1] "factor"
##
## $sequelae3
## [1] "factor"
##
## $sequelae4
## [1] "factor"
##
## $sequelae5
## [1] "factor"
##
## $sequelae6
## [1] "factor"
##
## $sequelae7
## [1] "factor"
##
## $sequelae8
## [1] "factor"
##
## $newhealth_change
## [1] "factor"
#factor change into numeric
used$sequelae1<-as.numeric(as.character(used$sequelae1))
used$sequelae2<-as.numeric(as.character(used$sequelae2))
used$sequelae3<-as.numeric(as.character(used$sequelae3))
used$sequelae4<-as.numeric(as.character(used$sequelae4))
used$sequelae6<-as.numeric(as.character(used$sequelae6))
used$sequelae8<-as.numeric(as.character(used$sequelae8))
#check again
lapply(used,class)
## $health_change
## [1] "factor"
##
## $sequelae1
## [1] "numeric"
##
## $sequelae2
## [1] "numeric"
##
## $sequelae3
## [1] "numeric"
##
## $sequelae4
## [1] "numeric"
##
## $sequelae5
## [1] "factor"
##
## $sequelae6
## [1] "numeric"
##
## $sequelae7
## [1] "factor"
##
## $sequelae8
## [1] "numeric"
##
## $newhealth_change
## [1] "factor"
we need to create a simple function, and put the dataset into the function.
#create a function
boot.f <- function(used, index){
fit <- glm(used$newhealth_change~used$sequelae1+used$sequelae2+used$sequelae3+used$sequelae4+used$sequelae6+used$sequelae8, family = binomial(), data = used, subset = index)
return( coef(fit))
}
#Use boot()to calculate 1000times’ interception and slops
set.seed(1)
boot(used,boot.f,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = used, statistic = boot.f, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 1.8040229 0.001371493 0.06833692
## t2* -0.4200263 -0.003586279 0.09084659
## t3* -0.3197376 0.004802877 0.09082546
## t4* 0.3531647 0.001189192 0.09415876
## t5* -0.2846220 -0.001685399 0.09499409
## t6* -0.4161676 0.002531271 0.11781765
## t7* -0.5881859 -0.000680805 0.12796815
fitting by sampling from observations with putbacks.
summary(glm(used$newhealth_change~used$sequelae1+used$sequelae2+used$sequelae3+used$sequelae4+used$sequelae6+used$sequelae8,family = binomial(),data = used))$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8040229 0.06649776 27.129079 4.470860e-162
## used$sequelae1 -0.4200263 0.09212672 -4.559224 5.134302e-06
## used$sequelae2 -0.3197376 0.08983161 -3.559299 3.718457e-04
## used$sequelae3 0.3531647 0.09339101 3.781570 1.558421e-04
## used$sequelae4 -0.2846220 0.09613040 -2.960791 3.068502e-03
## used$sequelae6 -0.4161676 0.11136556 -3.736950 1.862657e-04
## used$sequelae8 -0.5881859 0.12910410 -4.555904 5.216073e-06
Interpretation estimate SE SE(resampling) 1.8040229 0.06833692 0.06649776 -0.4200263 0.09084659 0.09212672 -0.3197376 0.09082546 0.08983161 0.3531647 0.09415876 0.09339101 -0.2846220 0.09499409 0.09613040 -0.4161676 0.11781765 0.11136556 -0.5881859 0.12796815 0.12910410 compare SE and SE(resampling), The numbers obtained are very close and we can say that the model we built performs very well.
Conclusion
The research aims to find out which sequelae can significantly impact physical health in infected individuals. By modeling multinomial logistic regression in the study, it was concluded that sequelae5(body aches, joint pain attributed to respondent’s Covid illness) and sequelae7(diarrhea, nausea attributed to respondent’s Covid illness) were not significant to Dependent variable (change in health in the last 3 months. And the other sequelae 1,2,3,4,6,8 were significant to dependent variable.
At the same time, the sequelae of covid-19 also show in the EDA those different countries, genders, ages, etc., will have different health situation. The big city respondents were more likely to feel healthy. The more developed countries, both men and women, will pay more attention to health. At the same time, the more developed countries have better medical standards, higher people’s happiness index, happy mood, and comfortable life, which directly affect the health level. Similarly, women’s health level is higher than that of men, because women’s living habits are better than men. In short, when faced with the advent of Covid-19, different physical conditions will also reflect different consequences.
As new variants of SARS-CoV-2 emerge, these viral strains continue to cause long-term complications. One variant may cause more damaging long-term effects than other strains, and infected individuals could require additional support and more rapid and intense treatment. This makes managing the aftereffects of COVID-19 infection even more difficult for healthcare providers in the next stage of the pandemic. Continued monitoring of these individuals is necessary to comprehend the extent and severity of these long-term symptoms. A greater understanding of the pathogenesis and methods for treating long COVID is necessary to decrease the burden. Post-hospital care of COVID-19 survivors has become a significant research priority, and adequate guidelines for the management of these patients are still being developed (Sanyaolu et al,2022). Always take precautions until the COVID-19 pandemic is over. The number of people with long-term symptoms of Covid-19 appears to be decreasing over time. But the new coronavirus only emerged at the end of 2019 and began a global pandemic the following year, so there is a lack of long-term data studies. Even if Covid-19 patients appear to be recovering now, they may be at risk for life. At the same time, with the passage of time, the amount of data will expand, and it may become clearer in future research.